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[57] ABSTRACT 

To support load instructions which execute out-of-order with 
respect to store instructions, a mechanism is implemented to 
detect (and correct) the occurrences where a load instruction 
executed prior to a logically prior store instruction, and 
where the load instruction received data for the location 
prior to being modified by the store instruction, and the 
correct data for the load instruction included bytes from the 
store instruction. Additionally, to execute store instructions 
out-of-order with respect to load instructions, a mechanism 
is implemented to keep a store instruction from destroying 
data that will be used by a logically earlier load instruction. 
Further, to support load instructions that are executed out- 
of-order with respect to each other, a mechanism is imple- 
mented to insure that any pair of load instructions (which 
access at least one byte in common) return data consistent 
with executing the load instructions in order. 

1 Claim, 6 Drawing Sheets 
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SUPPORT FOR OUT-OF-ORDER 
EXECUTION OF LOADS AND STORES IN A 
PROCESSOR 

CROSS-REFERENCE TO RELATED PATENT 
APPLICATIONS 

The present application is related to the following appli- 
cations: 

"METHOD FOR FAST UNIFIED INTERRUPT AND 
BRANCH RECOVERY SUPPORTING FULL OUT-OF- 
ORDER EXECUTION", U.S. patent application Ser. No. 
08/829,662, which is hereby incorporated by reference 
herein; 

"FORWARDING OF RESULTS OF STORE 
INSTRUCTIONS," U.S. patent application Ser. No. 08/826, 
854, which is hereby incorporated by reference herein; and 

"CHECKPOINT TABLE FOR SELECTIVE INSTRUC- 
TION FLUSHING IN A SPECULATIVE EXECUTION 
UNIT," U.S. patent application Ser. No. 08/934,960, which 
is hereby incorporated by reference herein. 

TECHNICAL FIELD 

The present invention relates in general to data processing 
systems, and in particular, to out-of-order execution of load 
and store instructions in a processor. 

BACKGROUND INFORMATION 

To achieve higher performance levels, processor and 
system designers attempt to increase processor and system 
clock rates and increase the amount of work done per clock 
period. Among other influences, striving for higher clock 
rates drives toward de-coupled designs and semi- 
autonomous units with minimal synchronization between 
units. Increased work per clock period is often achieved 
using additional functional units and attempting to fully 
exploit the available instruction-level parallelism. 

While compilers can attempt to expose the instruction- 
level parallelism which exists in a program, the combination 
of attempting to minimize path length and a finite number of 
architected registers often artificially inhibits a compiler 
from fully exposing the inherent parallelism of a program. 
There are many situations (such as the instruction sequence 
below) where register resources prevent a more optimal 
sequencing of instructions. 

FM FPR5—FPR4, FPR4 

FMA FPR2^FPR3, FPR4, FPR5 

FMA FPR4^FPR6, FPR7, FPR8 

Here, given that most processors have multi-cycle floating 
point pipelines, the second instruction cannot execute until 
several cycles after the first instruction starts to execute. In 
this case, although the source registers of the third instruc- 
tion might be expected to be available and the third instruc- 
tion is expected to be ready to execute before the second, the 
compiler cannot interchange the two instructions without 
selecting a different register allocation (since the third 
instruction currently overwrites the FPR4 value used by 
instruction 2). Often, selecting a register allocation which 
would be more optimal for this pair of instructions would be 
in conflict with the optimal register allocation for another 
instruction pair in the program. 

The dynamic behavior of cache misses provides another 
example where out-of-order execution can exploit more 
instruction-level parallelism than possible in an in-order 
machine. 
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In this example, on some iterations there will be a cache 
miss for the first load; on other iterations there will be a 
cache miss for the second load. While there are logically two 
independent streams of computation, in an in-order 
processor, processing will halt shortly after a cache miss and 
it will not resume until the cache miss has been resolved. 

This example also shows a cascading effect of out-of- 
order execution; by allowing progress beyond a stalled 
instruction (in this example an instruction which is depen- 
dent on a load with a cache miss), subsequent cache misses 
can be detected and the associated miss penalty can be 
overlapped (at least partially) with the original miss. The 
likelihood of overlapping cache miss penalties for multiple 
misses grows with the ability to support out-of-order load/ 
25 store execution. 

As clock rales go higher and higher, being able to overlap 
the cache miss penalties with useful computation and other 
cache misses will be of growing importance. 
30 Many current processors extract much of the available 
instruction-level parallelism by allowing out-of-order 
execution for all units except for the load/store unit. Mecha- 
nisms to support out-of-order execution for non-load/non- 
store units is well understood; all potential conflicts between 
two instructions can be detected by simply comparing the 
register fields specified statically in the instruction. 

Out-of-order execution of storage reference instructions is 
considerably a more difficult problem as conflicts can arise 
through storage locations, and the conflicts cannot be 
detected without the knowledge of the addresses being 
referenced. The generation of the effective/virtual address 
and the translations to a real address are normally performed 
as part of the execution of a storage reference instruction. 
Therefore, when a storage reference instruction is executed 
before a logically earlier instruction is executed, the address 
45 for the logically earlier instruction is not available for 
comparison during the execution of the current instruction. 

To support loads which execute out of order with respect 
to stores, a mechanism is required to detect (and correct) the 
occurrences where a load executed prior to a logically prior 
50 store; where the load got the data for the location prior to 
being modified by the store and the correct data for the load 
included bytes from the store operation. 

Similarly, to execute stores out of order with respect to 
loads, a mechanism is required to keep a store from destroy- 
ing data which will be used by a logically earlier load. 

Finally, to support loads that execute out of order with 
respect to each other, a mechanism is required to ensure that 
any pair of loads (which access at least one byte in common) 
return data consistent with executing the loads in order. This 
60 is an architectural requirement enforced by most, if not all, 
multiprocessor ("MP") systems. 

SUMMARY OF THE INVENTION 

65 The foregoing needs are addressed by the present 
invention, which discloses a processor capable of out-of- 
order load and store instructions, but provides several 
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mechanisms for delecting certain situations where an out- 
of-order load or store operation will result in invalid data 
occurring. 

More specifically, the present invention provides a means 
for detecting the occurrence of the execution of a load 5 
instruction ahead of a store instruction, where the load 
instruction requires data resulting from the store instruction. 
To accomplish this detection, a comparison is made between 
the load instruction being executed with store instructions 
within a store address queue. If there are any common bytes 1° 
between the load and store instructions, and if the load 
instruction is logically subsequent to the store instruction, 
men the load instruction and all subsequent instructions are 
Hushed from the execution units. 

Another occurrence that is detected by the present inven- 15 
tion is when a load instruction requires data resulting from 
a store instruction, but the load instruction executes after the 
store instruction has executed and the load operation 
received data from the cache while the store operation was 
still queued in the store address queue (i.e., prior to the store 20 
operation updating the cache). To accomplish this detection, 
the bytes and the program order tags associated with the 
store instruction being executed and the load instructions 
within a preload queue are compared. If there are any 
common bytes, and a load instruction is logically subsequent 25 
to the store instruction, then the load instruction and all 
subsequent instructions are flushed from the execution units. 

A third detection is accomplished by the present invention 
for the occurrence of out-of-order load instructions. If there 30 
are any common bytes between a load instruction being 
executed and any load instruction within a load hit load 
queue, and the load instruction being executed is logically 
older than the load instruction within the load hit load queue, 
and the load instruction in the load hit load queue is beyond 35 
the point where the load instructions can be reordered, then 
the logically younger load instruction and all subsequent 
instructions are flushed from the execution units. 

The foregoing has outlined rather broadly the features and 
technical advantages of the present invention in order that 40 
the detailed description of the invention that follows may be 
better understood. Additional features and advantages of the 
invention will be described hereinafter which form the 
subject of the claims of the invention. 

BRIEF DESCRIPTION OF THE DRAWINGS 45 

For a more complete understanding of the present 
invention, and the advantages thereof, reference is now 
made to the following descriptions taken in conjunction with 
the accompanying drawings, in which: $° 

FIG. 1 illustrates a data processing system configurable in 
accordance with the present invention; 

FIG. 2 illustrates a processor configured in accordance 
with the present invention; ^ 

FIG. 3 illustrates further detail of a load/store unit con- 
figured in accordance with the present invention; and 

FIGS. 4-6 illustrate processes for detecting out-of-order 
load and store operations. 

DETAILED DESCRIPTION 60 

In the following description, numerous specific details are 
set forth such as specific word or byte lengths, etc. to provide 
a thorough understanding of the present invention. However, 
it will be obvious to those skilled in the art that the present 65 
invention may be practiced without such specific details. In 
other instances, well-known circuits have been shown in 
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block diagram form in order not to obscure the present 
invention in unnecessary detail. For the most part, details 
concerning timing considerations and the like have been 
omitted inasmuch as such details are not necessary to obtain 
a complete understanding of the present invention and are 
within the skills of persons of ordinary skill in the relevant 
art 

Refer now to the drawings wherein depicted elements are 
not necessarily shown to scale and wherein like or similar 
elements are designated by the same reference numeral 
through the several views. 

Referring first to FIG. 1, an example is shown of a data 
processing system configurable in accordance with the 
present invention. The system has a central processing unit 
("CPU") 210, such as a PowerPC microprocessor 
("PowerPC* is a trademark of IBM Corporation) according 
to the "PowerPC Architecture: A Specification for a New 
Family of RISC Processors, " 2d edition, 1994, Cathy May, 
cl al. Ed., which is hereby incorporated by reference herein. 
A more specific implementation of a PowerPC micropro- 
cessor is described in the u PowerPC 604 RISC Micropro- 
cessor User's Manual," 1994, IBM Corporation, which is 
hereby incorporated by reference herein. 

The CPU 210 is coupled to various other components by 
system bus. Read only memory ("ROM") 116 is coupled to 
the system bus 211 and includes a basic input/output system 
("BIOS"), which controls certain basic functions of the data 
processing system. Random access memory ("RAM") 250, 
I/O adapter 118, and communications adapter 134 are also 
coupled to the system bus 211. I/O adapter 118 may be a 
small computer system interface ("SCSI") adapter that com- 
municates with a disk storage device 120 or tape storage 
drive 140. I/O adapter 118, disk storage device 120, and tape 
storage device 140 are also referred to herein as mass storage 
252. Communications adapter 134 interconnects bus 211 
with an outside network enabling the data processing system 
to communicate with other such systems. Input/output 
devices are also connected to system bus 211 via user 
interface adapter 122 and display adapter 136. Keyboard 
124, trackball 132, mouse 126, and speaker 128 are all 
interconnected to bus 211 via user interface adapter 122. 
Display monitor 138 is connected to system bus 211 by 
display adapter 136. In this manner, a user is capable of 
inputting to the system through the keyboard 124, trackball 
132, or mouse 126 and receiving output from the system via 
speaker 128 and display 138. Additionally, an operating 
system such as AIX ("AIX" is a trademark of the IBM 
Corporation) is used to coordinate the functions of the 
various components shown in FIG. 1. 

It should be noted that the data processing system con- 
figured in accordance with the present invention may be a 
multi-processing system including processors 101 and 102, 
in addition to processor 210, coupled to system bus 211. 

With reference now to FIG. 2, there is depicted a block 
diagram of an illustrative embodiment of a data processing 
system for processing information in accordance with the 
invention recited within the appended claims. In the 
depicted illustrative embodiment, CPU 210 comprises a 
single integrated circuit superscalar microprocessor. 
Accordingly, as discussed further below, CPU 210 includes 
various execution units, registers, buffers, memories, and 
other functional units, which are all formed by integrated 
circuitry. As illustrated in FIG. 2, CPU 210 is coupled to 
system bus 211 via bus interface unit (BIU) 212 and pro- 
cessor bus 213, which like system bus 211 includes address, 
data, and control buses. BIU 212 controls the transfer of 
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information between processor 210 and other devices During the fetch stage, sequential fetcher 217 retrieves 

coupled to system bus 211, such as main memory (RAM) one or more instructions associated with one or more 

250 and nonvolatile mass storage 252, by participating in memory addresses from instruction cache and MMU 214. 

bus arbitration. The data processing system illustrated in Sequential instructions fetched from instruction cache and 

FIG. 2 may include other unillustrated devices coupled to $ MMU 214 are stored by sequential fetcher 1217 within 

system bus 211, which are not necessary for an understand- instruction queue 219. In contrast, sequential fetcher 217 

ing of the following description and are accordingly omitted removes (folds out) branch instructions from the instruction 

for the sake of simplicity. stream and forwards them to BPU 218 for execution. BPU 

BIU 212 is connected to instruction cache and MMU 218 includes a branch prediction mechanism, which in one 

(Memory Management Unit) 214 and data cache and MMU 10 embodiment comprises a dynamic prediction mechanism 

216 within CPU 210. High-speed caches, such as those such ™ a branch histor y uble * 11111 enables BFU 218 10 

within instruction cache and MMU 214 and data cache and speculatively execute unresolved conditional branch instruc- 

MMU 216, enable CPU 210 to achieve relatively fast access lions b V predicting whether or not the branch will be taken, 

times to a subset of data or instructions previously trans- During the decode/dispatch stage, dispatch unit 220 

ferred from main memory 250 to the caches, thus improving , 5 decodes and dispatches one or more instructions from 

the speed of operation of the data processing system. Data instruction queue 219 to execution units 222, 228, and 230, 

and instructions stored within the data cache and instruction typically in program order. In a more conventional 

cache, respectively, arc identified and accessed by address processor, dispatch unit 220 allocates a rename buffer within 

tags, which each comprise a selected number of high-order GPR rename buffers 233 or FPR rename buffers 237 for each 

bits of the physical address of the data or instructions in 20 dispatched instruction's result data, and at dispatch, instmc- 

raain memory 250. Instruction cache and MMU 214 is *ions arc also stored within the multiple-slot completion 

further coupled to sequential fetcher 217, which fetches buffer of completion unit 240 to await completion. However, 

instructions for execution from instruction cache and MMU the Presem invention is adaptable to embodiments which 

214 during each cycle. Sequential fetcher 217 transmits require neither rename registers or completion units, 

branch instructions fetched from instruction cache and 25 According to the depicted illustrative embodiment, CPU 210 

MMU 214 to branch processing unit ("BPU") 218 for tracks the program order of the dispatched instructions 

execution, but temporarily stores sequential instructions during out-of-order execution utilizing unique instruction 

within instruction queue 219 for execution by other execu- identifiers. 

tion circuitry within CPU 210. During the execute stage, execution units 222, 228, and 

In the depicted illustrative embodiment, in addition to 30 230 execute instructions received from dispatch unit 220 

BPU 218, the execution circuitry of CPU 210 comprises opportunistically as operands and execution resources for 

multiple execution units for executing sequential the indicated operations become available. In one 

instructions, including fixed-point-unit ("FXU") 222, load/ embodiment, each of execution units 222, 228, and 230 are 

store unit ("LSU") 228, and floating-point unit ("FPU") 230. equipped with a reservation station that stores instructions 

Each of execution units 222, 228 and 230 typically executes 35 dispatched to that execution unit until operands or execution 

one or more instructions of a particular type of sequential resources become available. After execution of an instruc- 

instructions during each processor cycle. For example, FXU *ion has terminated, execution units 222, 228, and 230 store 

222 performs fixed-point mathematical and logical opera- data rzsuils, if any, within either GPRs or FPRs, depending 

tions such as addition, subtraction, ANDing, ORing, and "P°n the instruction type. In more conventional processors, 

XORing, utilizing source operands received from specified 40 execution units 222, 228, and 230 notify completion unit 

general purpose registers ("GPRs") 232. Following the 240 which instructions have finished execution. Finally, 

execution of a fixed-point instruction, FXU 222 outputs the instructions are completed in program order out of the 

data results of the instruction to GPR buffers 232, which completion buffer of completion unit 240. Instructions 

provide storage for the result received on result bus 262. executed by FXU 222 and FPU 230 are completed by 

Conversely, FPU 230 typically performs single and double- 45 transferring data results of the instructions from GPR 

precision floating-point arithmetic and logical operations, rename buffers 233 and FPR rename buffers 237 to GPRs 

such as floating-point multiplication and division, on source 232 and FPRs 236, respectively. 

operands received from floating-point registers ("FPRs") However, in various embodiments, the invention utilizes 

236. FPU 230 outputs data resulting from the execution of the dispatch logic of the processor to "tokenize , ' a classical 

floating-point instructions to selected FPR buffers 236, 50 Von Neumann instruction stream into a data flow-style 

which store the result data. As its name implies, LSU 228 format. Thus, data dependencies are not handled by tracking 

typically executes floating-point and fixed-point instructions ihe storage location of source data required by each 

which either load data from memory (i.e., either the data instruction, as in register renaming, but rather by associating 

cache within data cache and MMU 216 or main memory with an instruction certain information which enables track- 

250) into selected GPRs 232 or FPRs 236 or which store 55 ing source data by reference to another instruction which is 

data from a selected one of GPRs 232 or FPRs 236 to to provide the source data. Accordingly, the processor is 

memory 250. provided with a target identification ("TID") generator 

CPU 210 employs both pipelining and out-of-order whicn generates tokens, or tags, each of which is uniquely 

execution of instructions to further improve the performance associated with an instruction upon dispatch. The TIDs are 

of its superscalar architecture. Accordingly, instructions can 60 used to retain program order information and track data_ 

be executed by FXU 222, LSU 228, and FPU 230 in any dependencies. 

order as long as data dependencies are observed. In addition, The dispatch unit 220 in the present invention not only 

instructions are processed by each of FXU 222, LSU 228, assigns TIDs and dispatches instructions, but also updates 

and FPU 230 at a sequence of pipeline stages. As is typical various tables which arc used to track the status of the 

of high-performance processors, each sequential instruction 65 dispatched instructions. 

is processed at five distinct pipeline stages, namely, fetch, The CPU 210 supports out-of-order speculative instruc- 

decode/dispatch, execute, finish, and completion. tion execution. Instructions may be speculative on a pre- 
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dieted branch direction or speculative beyond an instruction 
that may cause an interrupt condition. In the event of a 
branch misprediction or an interrupt, hardware automati- 
cally flushes undesircd instructions from the pipelines and 
discards undesired results, presenting the effect of precise 
exceptions and sequentially executed instructions down the 
appropriate branch paths. Incorrect speculative results are 
selectively flushed from all units in one clock cycle, and 
instruction dispatch can resume the following clock cycle. 
One group identifier tag ("GID") is assigned per set of 
instructions bounded by outstanding branch or interruptible 
instructions. 

^—This invention will be described in terms of an imple- 
mentation that includes multiple load units and a single store 
unit. However, it should be clear to one skilled in the art that 
this invention could be modified to handle other configura- 
tions such as a single load/store unit, etc. The proposed 
invention allows loads to execute out of order with respect 
to other loads and stores and it allows stores to execute out 
of order with respect to all loads. 

As described above, all instructions are tagged in such a 
manner that relative age between any two instructions can be 
easily determined. The mechanism that will be assumed for 
this description is that of monotonically increasing values 
(TID). The TID value of each instruction is associated with 
queue entries and pipeline stages in which it resides. 

This TID-based approach allows hardware to implement 
an instruction flush mechanism (to respond to a processor- 
generated flush command) by performing a magnitude com- 
parison between the TID associated with the flush command 
and the TTD associated with a particular queue entry or 
functional unit stage and invalidating the entry if it is for an 
instruction which is as young or younger than the flushed 
instruction. All remnants of the flushed instruction (and all 
subsequent instructions) are "flushed" from the machine and 
the fetch unit is redirected to fetch starting at the address of 
the "flushed" instruction. 

Refer next to FIG. 3, where there is illustrated further 
detail of load/store unit 228 coupled to instruction queue 219 
and instruction cache 214. Also illustrated is floating point 
unit 230; however, floating point unit 230 is not a subject of 
this invention. FIG, 3 illustrates the basic functional units 
and instruction queues. The functional units are cluster A 
307, cluster B 308, and store unit 302. This invention centers 
around three queues and the interlocks between both these 
queues and the load and store units. The three queues are: 

store address queue 303, 

"preload" queue 309, and 

"load-hit-load" queue 315. 

Entries in each of these queues typically include the TID 
(or age indicator) of the instruction associated with the entry, 
the operand address, and the operand byte count. This 
information allows relative age determination between an 
entry and any other storage reference, as well as allows 
overlap detection, down to the byte level if desired. 

In one embodiment, "below" dispatch and "above" the 
load and store units are two instruction queues: all dis- 
patched loads are queued in the "PEQ" 306 while waiting to 
execute in a load unit, all stores are queued in the "SPQ" 301 
while waiting for the store unit 302. At the start of each 
cycle, hardware determines which store is the oldest dis- 
patched store that has not yet translated, if any such stores 
exist. For the instruction queue structure described above, 
this consists of examining the store unit (or units) for any 
untranslated stores. If any exist, the oldest one is deemed the 
"oldest untranslated store." If none exist, the SPQ 301 is 
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examined to find the oldest untranslated store. If such a store 
is found, it is deemed as the "oldest untranslated store." If 
none are found, the "oldest untranslated store" pointer 
defaults to the next instruction to be dispatched. 
5 The store address queue 303 is a FIFO list of all stores that 
have translated, but the associated data has not yet been 
written to the LI cache 310, 311. Entries are created as a 
result of the translation of store instructions at execute; 
entries are removed as a result of writing the associated data 
to the LI cache 310, 311. Associated with the store address 
queue 303 is the store data queue 304. As stores are 
dispatched, entries are allocated in the store data queue 304. 
If the store data is available as the entry is allocated, the data 
is placed in the store data queue 304. Otherwise, as the data 
is generated by the functional units, the store data queue 304 
15 will snoop the result buses and capture the data in the store 
data queue 304. Like the store address queue 303, entries are 
removed as bytes are written to the LI cache 310, 311. 

The store data queue 304 and the store address queue 303 
are coupled to the store data queue processing unit 305, 
20 which is coupled to the load miss queue 312, which is 
coupled to the L2 cache arbitration logic 313. Further 
description of these units is not presented, since such a 
description is not necessary for describing the present inven- 
tion. Please note that other functional blocks may be imple- 
25 menled within load/store execution unit 228, but have not 
been shown for reasons of simplicity and clarity. 

If both the store execution unit 302 and SPQ 301 were 
examined concurrendy and with equal weight, then this 
invention is extendable to the case where stores are executed 
30 out of order with respect to other stores. In this description, 
it is assumed that stores execute in order; therefore, the 
execution unit 302 is examined first and with higher priority 
for establishing a store as the "oldest untranslated store," 
In-order execution of stores also implies that the store 
35 address queue 303 can be managed as a first-in- first -out 
(FIFO) queue while avoiding deadlock concerns stemming 
from store address queue space. 

The preload queue 309 is specific to this invention and 
holds the addresses of all translated loads which logically 
40 follow the "oldest untranslated store." At the start of each 
cycle, it is determined whether any loads executing in the 
load unit are logically subsequent instructions to the "oldest 
untranslated store." If they are, then they are considered 
"preloads" and require an entry in the preload queue 309 to 
4 5 execute. If no room exists in the preload queue 309 and an 
entry is needed, one of two actions results: 

If the load in execute is younger than (logically subse- 
quent to) all loads in the preload queue 309, then this 
load (and all subsequent instructions) is flushed from 
50 the machine 210 and the instruction fetch unit 217 is 
redirected to begin fetching at the address of the flushed 
load instruction. 
If an entry in the preload queue 309 is younger than the 
load in execute that requires a preload queue entry, then 
55 the youngest load in the preload queue 309 (and 
subsequent instructions) is flushed and re -fetched and 
the load in execute is given the flushed load's entry in 
the preload queue 309. 
For implementations that allow more than one load in 
60 execute to require a preload queue entry in the same cycle, 
the above is modified in a straightforward manner, namely 
the results are as if the loads are processed by the above 
rules, one load at a time, starting with the oldest load. For 
example, if two loads in execute each require a preload 
65 queue entry and only one entry exists, then the oldest load 
in execute gels the available entry and then the youngest 
load in execute follows the rules above, for a full queue 309. 
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At the end of each cycle, valid entries in the preload queue 
309 are compared to the "oldest untranslated store" age; any 
entries which arc older than (logically prior to) the "oldest 
untranslated store" are invalidated (discarded). Preload 
queue entries can also be invalidated as a result of a Hush 
command if the preload queue entry is for a load instruction 
which is the subject (or younger) instruction of a flush 
command. 

The store address queue 303 contains the addresses of 
stores that have been translated but have not yet written their 
data to the cache 310, 311. In addition to the purposes used 
by this invention, this queue 303 allows stores to be trans- 
lated and exceptions detected without waiting for the store 
data. De-coupling these two portions of a store instruction is 
key to de-coupling the fixed-point portion of the processor 
210 (which usually does the address generation/translation 
for storage references) from the floating-point portion 230 
(which generates/normalizes floating-point data). Several 
current designs include such store address queues 303. As in 
most existing implementations, the store address queue 303 
is managed in a FIFO manner and the oldest entry in the 
store address queue 303 is the next entry to write to the cache 
310, 311. It should be clear to one skilled in the art that 
entries other than the oldest entry could be written to the 
cache 310, 311, allowing younger stores with data to write 
ahead of older stores waiting on data. It should also be clear 
that the invention as described would not require modifica- 
tion to handle this improvement. 

Store address queue entries are invalidated (discarded) 
under two conditions: 

The associated store operation has been performed to the 

cache 310, 311, or 
A flush command signals that a store address queue entry 
should be discarded because it is younger than the 
subject of an instruction flush command. 
The load -hit-load queue 315 is specific to this invention 
and holds the addresses of all translated loads that logically 
follow the oldest untranslated load or store (see step 601 of 
FIG. 6). At the start of each cycle, it is determined whether 
any loads executing in the load unit are logically subsequent 
instructions to the oldest untranslated load or store. If they 
are, then they require an entry in the load-hit-load queue 315 
to execute. If no room exists in the load-hit -load queue 315 
and an entry is needed, one of two actions results: 

If the load in execute is younger than (logically subse- 
quent to) all loads in the load-hit-load queue 315, then 
this load (and all subsequent instructions) is flushed 
from the machine 210 and the instruction fetch unit 217 
is redirected to begin fetching at the address of the 
flushed load instruction. 
If an entry in the load-hit-load queue 315 is younger than 
the load in execute which requires a preload queue 
entry, then the youngest load in the load-hit-load queue 
315 (and subsequent instructions) is flushed and 
re -fetched and the load in execute is given the flushed 
load's entry in the load-hit-load queue 315. 
For implementations that allow more than one load in 
execute to require a load-hit-load queue entry in the same 
cycle, the above is modified in a straightforward manner, 
namely the results are as if the loads are processed by the 
above rules, one load at a time, starting with the oldest load. 
For example, if two loads in execute each require a load- 
hit-load queue entry and only one entry exists, then the 
oldest load in execute gets the available entry and then the 
youngest load in execute follows the rules above for a full 
queue 315. 

At the end of each cycle, valid entries in the load-hit-load 
queue 315 are compared to the oldest untranslated load or 
store age; any entries which are older than (logically prior 
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to) the oldest untranslated load and oldest untranslated store 
are invalidated (discarded). Load-hit-load queue entries can 
also be invalidated as a result of a flush command if the 
load-hit-load queue entry is for a load instruction which is 
s the subject (or younger) instruction of a flush command. 
Note that the preload queue 309 and load-hit- load queue 
315 contain similar information and, in some 
implementations, could be merged into a single structure. 

The above description details the conditions under which 
entries are created and discarded in the three primary queues 

10 for this invention: the preload queue 309, the store address 
queue 303, and the load-hil-load queue 315. This next 
section details the address checks which are performed 
between queue entries to provide architectural storage con- 
sistency requirements described previously herein. 

15 Referring next to FIG. 4, to detect the occurrence of a load 
that executed ahead of a store, but actually requires data 
from the store, the mechanism involves a comparison (step 
401) between the store currently in execute (and being 
translated) "this cycle" and loads which are in the preload 

20 queue 309. These preload queue entries represent all loads 
that executed earlier and which could possibly have an 
address conflict with the current store. If it is determined that 
there are some bytes in common given the address and byte 
count for the store and the address and byte count for a 

25 preload entry (step 402), and that the preload queue entry is 
for a load that is logically subsequent to the store (step 403), 
then the data returned for the load should include data 
generated by the store. However, the load may have 
accessed a "stale" copy of the data from the cache 310, 311 
prior to being updated by the contents of the current store. 

30 In this case, a flush command is generated to flush the 
offending load (and all subsequent instructions) from the 
machine and the fetch mechanism is directed to fetch 
starting at the address of the flushed load (step 404). 
Otherwise, in step 405, the process proceeds normally. 

35 Referring next to FIG. 5, another example to consider is 
where 

a load requires data from a store 
the load executes after the store executed, and 
the load operation got data from the cache 310, 311 while 
40 the store operation was still queued in the store address 
queue 303 (i.e. prior to the store operation updating the 
cache 310, 311) 
This example can be detected by comparing each load in 
execute to each store address queue entry (step 501). If a 
4 5 load/store comparison pair indicates that the load is logically 
later than the store (step 503), the load requires bytes from 
the store (step 502), and the load got data prior to the store 
updating the cache, then a flush command is generated to 
flush the offending load (and all subsequent instructions) 
50 from the machine 210, and the fetch mechanism is directed 
to fetch starting at the address of the load (step 504). (A 
similar check is performed in existing in-order machines 
which implement a store address queue; however, they hold 
the load in execute and re-access the cache once the store 
5 c operation updates the cache.) 

Assuming that preload queue entries and store address 
queue entries are created essentially at the end of the execute 
cycle, one final set of checks is required to handle the case 
for a load and store that execute during the same cycle. A 
straightforward solution is to construct the logic (which 
6C checks loads in execute against the store address queue 303) 
so that a store at execute appears logically as one extra store 
address queue entry. To allow out-of-order stores with 
respect to loads, while preventing stores from destroying 
data which might be required by a logically earlier load, 
65 stores arc prohibited from updating the cache 310, 311 until 
all prior interruptible instructions are known not to generate 
an exception. (This is an existing necessary condition in 
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most processors since delecting an interrupt exception for a 
logically earlier instruction after performing a store opera- 
tion would require additional complexity to restore the cache 
310, 311 to the state as expected at the interrupt point.} 
Assuming the cache path access time is the same (or longer) s 
for a store as for a load, and that the load cache access is 
performed in parallel with the load translation check (which 
is done prior to posting the load's interrupt exception status), 
then the above interlock will allow load to access the cache 
310, 311 and get the "old" data prior to the location being 
updated by any logically subsequent store. This part of the io 
mechanism requires no address comparison. If a load gen- 
erates a cache miss, then it is known that the cache miss 
condition also exists for any store operation for which bytes 
overlap with the load. The correct data consistency can be 
enforced in this case by ensuring that logically earlier loads 
get data from cache misses prior to stores updating the same 
cache line. 

Referring next to FIG. 6, the third storage architecture 
requirement described above is that (for a pair of loads 
which access at least one byte in common) loads return data 
consistent with executing the loads in order. Consider an MP 20 
system (see FIG. 1) where processor PI 210 is executing a 
sequence of loads to real address RA while processor P2 101 
performs a store to the same address RA. If PI 210 executes 
the two loads L1A and L2A out of order, then L2A may get 
the RA value before P2 101 stores while L1A could get the 25 
RA value after the P2 store. This could cause incorrect 
program behavior on PI 210 based on the standard program 
model. 

The goal of the load-hit-load queue 315 is to detect such 
conditions and force the older load to return the data equal 30 
to or older than the younger load. One possible solution 
would detect a store from another processor that may (or 
does) fall between two out-of-order loads. This requires 
address comparators and the ability to recover when this 
event is detected in an out-of-order machine, and may ^ 
require significantly more queue space to hold state required 
for a potential recovery. The solution of the present inven- 
tion is to enforce age-based ordering between loads that 
have executed and to provide recovery for those cases where 
a load that is executing is already in possible violation of the 
load-load ordering rules, 40 

Specifically, load-load ordering can be determined at the 
time a load executes. If a load in execute does not match an 
entry in the load-hit-load queue, the loads progresses nor- 
mally. If the load does match (at least a one -byte overlap) a 
load-hit-load queue entry (step 602) and the load in execute 45 
is younger (step 603) or the load-hit-load queue element is 
not beyond the point where the loads can be reordered (step 
604), the loads are reordered (change or mark instructions in 
the load-hit-load queue 315 such that the oldest load 
receives its data first) and then progress normally (step 606). 50 
If the load in execute matches a load-hit-load queue element 
(step 602), the load in execute is older than the load-hit-load 
queue element (step 603) and the load-hit-load queue ele- 
ment is beyond the point where the loads could be reordered 
(step 604), the younger load (and all subsequent 5S 
instructions) are flushed from the processor and the fetch 
mechanism is directed to fetch starting at the address of the 
flushed load (step 605). 

Note that in all of the above cases where a flush command 
is generated, it is generated for the offending load. 

Although the present invention and its advantages have 60 
been described in detail, it should be understood that various 
changes, substitutions and alterations can be made herein 
without departing from the spirit and scope of the invention 
as defined by the appended claims. 

What is claimed is: 65 

1. A multi-processor system comprising: 

a first processor; 
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a second processor coupled to the first processor and a 

memory system via a bus system; 
wherein the first and second processors each comprise: 
a store execution unit; 
a load execution unit; 

an issue unit operable for issuing load and store instruc- 
tions to the load execution unit and the store execution 
unit, respectively, in an out-of-order sequence; 

circuitry for comparing a first store instruction being 
executed with first load instructions in a preload queue; 

circuitry for determining whether there are any common 
bytes between the first store instruction and any of the 
first load instructions in the preload queue; 

if there are any common bytes between the first store 
instruction and any first load instructions in the preload 
queue, then circuitry for determining if one of the first 
load instructions in the preload queue having any 
common bytes with the first store instruction is logi- 
cally subsequent to the first store instruction; 

if one of the first load instructions is logically subsequent 
to the first store instruction, circuitry for flushing one of 
the first load instructions and all subsequent instruc- 
tions; 

circuitry for comparing a second load instruction being 
executed with second store instructions in a store 
address queue; 

circuitry for determining if there are any common bytes 
between the second load instruction and any of the 
second store instructions in the store address queue; 

if there are any common bytes between the second load 
instruction and any of the second store instructions in 
the store address queue, circuitry for determining if the 
second load instruction is logically subsequent to any 
of the second store instructions; 

if the second load instruction is logically subsequent to 
any of the second store instructions, circuitry for flush- 
ing the second load instruction and all subsequent 
instructions; 

circuitry for entering third load instructions into a load hit 
load queue if the loaded instructions have been trans- 
lated and are logically subsequent to an oldest untrans- 
lated load or store instruction; 

circuitry for determining if there are any common bytes 
between a third load instruction being executed and any 
load instruction in the load hit load queue; 

if there are any common bytes between the third load 
instruction being executed and any load instruction in 
the load hit load queue, circuitry for determining if the 
third load instruction being executed is logically older 
than the load instruction in the load hit load queue; 

if the third load instruction being executed is logically 
older than the load instruction in the load hit load 
queue, circuitry for determining if the third load 
instruction in the load hit toad queue is beyond the 
point where the load instructions in the load hit load 
queue can be reordered; and 

if the load instruction in the load hit load queue is beyond 
the point where the load instructions in the load hit load 
queue can be reordered, circuitry for flushing the logi- 
cally younger load instruction and all subsequent 
instructions. 

* + * + * 
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